Data about the hybrid open access uptake is critical. Although policy recommendations have addressed open and transparent workflows in recent years, identifying hybrid open access articles and funding sources remains challenging. In this post, we show how to mine such data from Elsevier. Between 2015 and July 2019, Elsevier’s subscription journals published 63,577 hybrid open access articles, representing 2.3% of the overall publication volume of these journals. A data analysis reveals a growing uptake of agreements between Elsevier and funders to cover costs for open access. Not surprisingly, mostly British and Dutch funders sponsor hybrid open access. But also the German Federal Ministry of Education and Research is well represented despite the current Elsevier boycott from most universities and research organizations in Germany. Nevertheless, the majority of funding sources is still unknown, raising important questions about the transparency of this publishing model.
In September 2018, the cOAltion S, a group of international research funders, announced its widely discussed Plan S. According to its principles, publication fees that may arise when publishing open access should be covered by funders or research organizations. Although surveys(Solomon and Björk 2011; Dallmeier-Tiessen et al. 2011) suggest that many authors do not pay publication fees themselves, publishers rarely share such evidence. But also not all funders and research organizations report open access spendings publicly, resulting in an intransparent situation where policy-making and analytics lack essential data about transitioning subscription journals to open access.
This blogpost presents a dataset comprising publicly available hybrid open access sponsorship information from Elsevier, a major publisher of scholarly journals. This dataset, which was created using metadata from Crossref and mining open access full-texts, serves as critical input to ongoing discussions around transition subscription journals to open access. The methods used to obtain the data not only address key challenges to discover hybrid open access articles along with funding and affiliation information with open tools and services. Elsevier’s effort to make such data openly available also serves as a good practice example, potentially feeding into workflow guidelines for transformative agreements like the ESAC guidelines.
To demonstrate its potential, the dataset will be used to present the number and proportion of hybrid open access articles among Elsevier journals. Drawing on Elsevier’s funding information, it will be also investigated whether publication fees were billed to authors or to funders that made an agreement with Elsevier, or if the fees were waived. Moreover, text-mined author email domains will be presented as rough approximation of the affiliation of the first resp. corresponding author, an important data point for delineating open access funding.
The resulting dataset is openly available on GitHub along with the source code.
As a start, the Elsevier publication fee price list, shared as pdf document, was used to obtain hybrid open access journals. The rOpenSci tabulizer package allowed to extract data about these journals from this file.
Following the Hybrid Open Access Journal Dashboard, an interactive analytical application from the SUB Göttingen, Crossref REST API was queried to discover open access articles published in these journals drawing on facet field counts along with the yearly article volumes for the period 2015 - 2019. After matching license URLs indicating open access articles, a second API call checked license metadata per journal. Here, using the Crossref’s REST API filters license.url and license.delay allowed to exclude delayed open access articles. For every immediate open access article, comprehensive Crossref metadata was obtained including full-text links.
Elsevier provides access to full-texts as html and xml document via the Crossref Text and Data Mining Services (Crossref-TDM). Surprisingly, the xml representation not only contains the full-text, but also comprehensive metadata including information about open access sponsorship shown below.
<openaccess>1</openaccess>
<openaccessArticle>true</openaccessArticle>
<openaccessType>Full</openaccessType>
<openArchiveArticle>false</openArchiveArticle>
<openaccessSponsorName>
Arts and Humanities Research Council
</openaccessSponsorName>
<openaccessSponsorType>FundingBody</openaccessSponsorType>
<openaccessUserLicense>
http://creativecommons.org/licenses/by/4.0/
</openaccessUserLicense>
Snapshot of open access metadata in Elsevier XML full. https://api.elsevier.com/content/article/pii/S1475158518302261
After interfacing the Elsevier full-texts with the crminer package, a client maintained by rOpenSci, the above-highlighted open access information was extracted from the `xml documents.
Moreover, the first author email address was parsed using pattern matching, assuming that email domains roughly indicate the affiliation of the first respective corresponding author at the time of publication. Next, the email domains was split in its parts with urltools.
The resulting dataset comprises the following variables, and is openly shared via GitHub.
First ten rows
library(rmarkdown)
hybrid_df <- readr::read_csv("data/els_hybrid_info_normalized.csv")
paged_table(head(hybrid_df, 10))
It must be noted, however, that open access information from Elsevier full-text was not documented at the time of writing this blogpost.
In total, the dataset comprises 63,577 hybrid open access articles from 1,703 hybrid open access journals published between January 2015 and July 2019.
Using this datasets, the share of hybrid open access articles per journal was calculated. To explore variations among journals, Bob Rudis ggeconodist package was used. The package does a great job replicating a boxplot aesthetics from The Economist magazine.
The figure shows a slow, but steady hybrid open access uptake. The median open access proportion was around 3% in the first seven months in 2019. 1,703 of 1,985 subscription journals from Elsevier offering hybrid open access did in fact publish at least one article under this model, corresponding to an share of 86 %.
Elsevier usually requires authors to pay a publication fee, also known as article processing charge (APC) to publish open access. Many authors make use of funding from grant agencies or academic institutions to cover such fees. To streamline this process, some funding bodies and institutions have agreed central payment options for affiliated researcher. Elsevier also provides APC waivers.
In most cases, payment notifications were send to the authors paid directly 59 %. Elsevier lists a funding body covering the open access publication fee for around one third of articles.
The following interactive visualization let’s you browse for funders as disclosed by Elsevier.
Mostly British and Dutch funders sponsored hybrid open access in Elsevier journals. But also the German Federal Ministry of Education and Research (BMBF) is well represented despite the current boycott from most universities and research organizations in Germany. Since 2018, the BMBF financially supported 152 hybrid open access articles that appeared in 110 Elsevier journals according to the publisher.
In addition to funding information, email domains were parsed from Elsevier full-texts. These domains roughly indicate the affiliation of the first or of the corresponding authors, respectively, a data point used to delineate open access funding. In the following, a hierarchical, interactive treemap visualizes the distribution of the email domains. Each top-level domain can be subdivided further into domain names representing academic institutions or companies. The size of each rectangle is proportional to the number of hybrid open access articles corresponding to this domain.
Dallmeier-Tiessen, Suenje, Robert Darby, Bettina Goerner, Jenni Hyppoelae, Peter Igo-Kemenes, Deborah Kahn, Simon C. Lambert, et al. 2011. “Highlights from the SOAP Project Survey. What Scientists Think About Open Access Publishing.” http://arxiv.org/abs/1101.5260.
Solomon, David J., and Bo-Christer Björk. 2011. “Publication Fees in Open Access Publishing: Sources of Funding and Factors Influencing Choice of Journal.” Journal of the Association for Information Science and Technology 63 (1). Wiley-Blackwell: 98–107. https://doi.org/10.1002/asi.21660.